Thanks to Eric Anderson for lesson ideas and code!
install necessary packages:
install.packages(c("ggplot2","lubridate", "plyr", "mosaic", "mosaicData", "reshape2","png"))Pull the most recent version of the rep-res-course repo just before coming to class.
ggplot2s underlying philosophy and how to work with it.ggplot2reshape2 package for converting between wide and long formatsmosaic package to experiment with different plotsggplot and ggplot2 are similar. `ggplot2 is just more recent (and recommended)ggplot operates on data frames
ggplot everything you want to use in a graphic must be contained within a data frameWe will try to touch on everything but coords today.
Without getting into the complications of scales and coordinate systems here, in a nutshell, is what ggplot does:
Phew! That is a crazy mouthful. Is this really going to help us make pretty plots?
All I can say is you owe it to yourself to persevere — ggplot2 is really worth the effort!
Here we read them in:
library(ggplot2)
library(lubridate)
library(plyr)
##
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
##
## here
# first off read the data into a data frame
pv <- read.table("https://ericcrandall.github.io/BIO444/lessons/ggplot/data/mens_pole_vault_clean.txt",
sep = "\t",
header = TRUE,
stringsAsFactors = FALSE
)
#change date to true date format
pv$Date<-ymd(pv$Date)
#View(pv)Here are the first few rows of our data fram:
head(pv)
## Meters Athlete Nation Venue Date X
## 1 4.02 Marc Wright United States Cambridge, U.S. 1912-06-08 1
## 2 4.09 Frank Foss United States Antwerp, Belgium 1920-08-20 1
## 3 4.12 Charles Hoff Norway Copenhagen, Denmark 1922-09-22 1
## 4 4.21 Charles Hoff Norway Copenhagen, Denmark 1923-07-22 2
## 5 4.23 Charles Hoff Norway Oslo, Norway 1925-08-13 3
## 6 4.25 Charles Hoff Norway Turku, Finland 1925-09-27 4qplot that behaves more like R’s base graphics function plot().
qplot. It will just lengthen the time it takes to understand the grammar of graphics.ggplot() standard syntax.First we have to essentially establish a plotting area upon which to add layers. We will do this like so:
g <- ggplot()
At this point, g is a ggplot plot object. We can try printing it:
g
Adding a layer is done by adding a collection of geometric objects to it using one of the geom_xxxx functions. Each such function requires a data set and a mapping of columns in the data set to aesthetics. Let’s make some scatter-points: Meters as a function of Date:
g2 <- g + geom_point(data = pv, mapping = aes(x = Date, y = Meters))
g2
Here are some interesting points about:
```r
g2 <- g + geom_point(data = pv, mapping = aes(x = Date, y = Meters))
g2
```
a. You add layers by catenating them with `+`.
b. the names of the columns don't need to be quoted.
c. when you map aesthestics you wrap them inside the `aes()` function
b. the full object with all the layers is returned into `g2` and then we printed it (by typing `g2`).
(we could have also just said `g + geom_point(data = pv, mapping = aes(x = Date, y = Meters))`)
d. we didn't have to do anything fancy to the dates...ggplot knew how to plot them.
I want to overlay a line on that…No problem! Add another layer:
g3 <- g2 + geom_line(data = pv, mapping = aes(x = Date, y = Meters))
g3
+ sign!) another layer—one that had a line on it. BUT! what if I want to make that line blue?Make the line blue. Note that you are giving the line an aesthetic property (the color blue), but you are not mapping that to any values in the data frame, so you don’t put that within the aes() function:
g4 <- g2 + geom_line(data = pv, mapping = aes(x = Date, y = Meters), color = "blue")
g4
That worked! Notice that we were able to put that new layer atop g2 which we had stored previously.
data = pv, mapping = aes(x = Date, y = Meters) isn’t there some way around that?Yes! You can pass a default data frame and/or default mappings to the original ggplot() function. Then, if data and mappings are not specified in later layers, the defaults are used.
d <- ggplot(data = pv, aes(x = Date, y = Meters)) # this defines defaults
d2 <- d + geom_point() # add a layer with points
d2 # print it
geom_xxx() function.Let’s go totally crazy!
Establish plot base with defaults:
d <- ggplot(data = pv, aes(x = Date, y = Meters))Add a transparent turquoise area along the back:
d2 <- d + geom_ribbon(aes(ymax = Meters), ymin = min(pv$Meters), alpha = 0.4, fill = "turquoise")
d2
Wow! Transparency is not so easy in R base graphics!
Put a line along there too:
d3 <- d2 + geom_line(color = "blue")
d3
Now add some small orange points:
d4 <- d3 + geom_point(color = "orange")
d4
Now, add “rugs” along the \(x\) and \(y\) axes that show the position of points, and in them, map color to Nation:
d5 <- d4 + geom_rug(sides = "bl", mapping = aes(color = Nation))
d5
aes() function to add the mapping of color to Nation within the geom_rug() function.
x, y, and color (or fill). And some have more (or other) aesthetics you can map values to.In the past you have probably encountered data in two different formats, wide and long, without ever really thinking about it. Wide data are sometimes easier for humans to grok. Long data are easier to use with statistical packages, and also make more sense at a basic level: each row is an observation. ### Grand Rapids, MI snow data 1893-2011 * We will explore this using the snowfall data from Grand Rapids, Michigan that is in the mosaicData package. * Here we will print out a small part of it:
```r
library(mosaicData)
dim(SnowGR) # how many rows and columns?
```
```
## [1] 119 15
```
```r
head(SnowGR) # have a look at the first few rows
```
```
## SeasonStart SeasonEnd Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May
## 1 1893 1894 0 0 0 0.0 8.0 24.9 12.5 6.8 4.8 2 0
## 2 1894 1895 0 0 0 0.0 7.5 5.3 21.5 8.0 22.5 0 0
## 3 1895 1896 0 0 0 0.4 23.2 15.0 NA 8.5 2.0 0 0
## 4 1896 1897 0 0 0 0.2 8.0 8.0 4.9 11.2 12.0 0 0
## 5 1897 1898 0 0 0 0.0 1.4 8.0 15.5 29.5 0.0 0 0
## 6 1898 1899 0 0 0 0.0 18.5 18.0 20.0 3.4 16.0 0 0
## Jun Total
## 1 0 59.0
## 2 0 64.8
## 3 0 49.1
## 4 0 44.3
## 5 0 54.4
## 6 0 75.9
```
SnowGR is in wide format.However, let us drop the “Total” column from it first because that is just computed from the other columns and is not “directly measured”.
Snow <- SnowGR[, -ncol(SnowGR)]And the Values occupy most of the table.
reshape2 is perhaps the nicest utility for converting between long and wide format (of course).melt function which converts from wide to long format.melt takes a few arguments. The most important are these:
Let’s see it in action:
library(reshape2)
longSnow <- melt(data = Snow,
id.vars = c("SeasonStart", "SeasonEnd"),
variable.name = "Month",
value.name = "Snowfall"
)
head(longSnow) # see what it looks like
## SeasonStart SeasonEnd Month Snowfall
## 1 1893 1894 Jul 0
## 2 1894 1895 Jul 0
## 3 1895 1896 Jul 0
## 4 1896 1897 Jul 0
## 5 1897 1898 Jul 0
## 6 1898 1899 Jul 0Note that you have to quote the column names (unlike in ggplot!) ## Plotting some snowfall
We are going to make some plots to underscore some particular points about ggplot
Let’s make simple boxplots summarizing snowfall for each month over all the years. We want Month on the \(x\)-axis and we will map Snowfall to the \(y\)-axis.
g <- ggplot(data = longSnow, mapping = aes(x = Month, y = Snowfall))
g + geom_boxplot()
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
Because, by default, melt made a factor of the Month column and ordered the values as they appeared in the data frame:
class(longSnow$Month)
## [1] "factor"
levels(longSnow$Month)
## [1] "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" "Jan" "Feb" "Mar" "Apr" "May"
## [12] "Jun"ggplot applies colors to things, it uses discrete values if the values are discrete (aka factors or characters) and continuous gradients if they are numeric.We can plot points, coloring them by SeasonStart (which is numeric…) instead of making boxplots
g + geom_point(aes(color = SeasonStart))
## Warning: Removed 7 rows containing missing values (geom_point).
If we made a factor out of SeasonStart, then each SeasonStart gets it own color (far more than would be reasonable to visualize)
g + geom_point(aes(color = factor(SeasonStart)))
## Warning: Removed 7 rows containing missing values (geom_point).
So that is a little silly.
This is easily achieved with geom_jitter() instead of geom_point()
g + geom_jitter(aes(color = SeasonStart))
## Warning: Removed 7 rows containing missing values (geom_point).
Let’s consider looking at a histogram of Snowfall:
d <- ggplot(data = longSnow, aes(x = Snowfall)) + geom_histogram(binwidth = 2, fill = "blue")
d
## Warning: Removed 7 rows containing non-finite values (stat_bin).
We can add a facet_wrap specification. We are “faceting”" in one long strip by month but then “wrapping” it to fit on the page in 4 columns:
d + facet_wrap(~ Month, ncol = 4)
## Warning: Removed 7 rows containing non-finite values (stat_bin).
facet_grid. In fact that is where the ~ sort of specification comes from. # make intervals
longSnow$Interval <- cut_interval(longSnow$SeasonStart, n = 3, dig.lab = 4)
# here is what they are:
levels(longSnow$Interval)
## [1] "[1893,1932]" "(1932,1972]" "(1972,2011]"
Now, we are really just interested in the winterish months: November, December, January, February and March. How do we subset the data to just include these months? There is a construct in r called %in%. It is syntactically the same as match(), but sometimes more intuitive to use. It is used in the same way that which is used, but for a vector of possible matches, rather than a single T/F test.
# get just the months we are interesed in
winterMonths<-c("Nov", "Dec", "Jan", "Feb", "Mar")
winterSnow <- longSnow[longSnow$Month %in% winterMonths,]
levels(winterSnow$Month)
## [1] "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" "Jan" "Feb" "Mar" "Apr" "May"
## [12] "Jun"
Notice that the levels of the summer months are still part of the factor!
winterSnow<-droplevels(winterSnow)
levels(winterSnow$Month)
## [1] "Nov" "Dec" "Jan" "Feb" "Mar"
Now we can plot that with a facet_grid:
w <- ggplot(winterSnow, aes(x = Snowfall, fill = Interval)) +
geom_histogram(binwidth = 2) +
facet_grid(Interval ~ Month)
w
## Warning: Removed 4 rows containing non-finite values (stat_bin).
How about a different kind of plot that is like a continuous boxplot? Just for fun.
y <- ggplot(winterSnow, aes(x = Month, y = Snowfall, fill = Interval))
y + geom_violin() + facet_wrap(~ Interval, ncol = 1)
## Warning: Removed 4 rows containing non-finite values (stat_ydensity).
geom_xxx() function is associated with a default stat_xxxx function and vice-versa.This is not entirely trivial, but can be done by looking at the output of ggplot_build()
boing <- ggplot_build(d) #let's look at our simple histogram
## Warning: Removed 7 rows containing non-finite values (stat_bin).
head(boing$data[[1]])
## y count x xmin xmax density ncount ndensity PANEL group ymin
## 1 797 797 0 -1 1 0.28043631 1.00000000 2842.0000 1 -1 0
## 2 71 71 2 1 3 0.02498241 0.08908407 253.1769 1 -1 0
## 3 71 71 4 3 5 0.02498241 0.08908407 253.1769 1 -1 0
## 4 73 73 6 5 7 0.02568614 0.09159348 260.3087 1 -1 0
## 5 76 76 8 7 9 0.02674173 0.09535759 271.0063 1 -1 0
## 6 60 60 10 9 11 0.02111189 0.07528231 213.9523 1 -1 0
## ymax colour fill size linetype alpha
## 1 797 NA blue 0.5 1 NA
## 2 71 NA blue 0.5 1 NA
## 3 71 NA blue 0.5 1 NA
## 4 73 NA blue 0.5 1 NA
## 5 76 NA blue 0.5 1 NA
## 6 60 NA blue 0.5 1 NAThis can be a fun exercise to get your head around what sort of data are produced by each of the stats (and there are quite a few of them! http://docs.ggplot2.org/current/ )
We can add stats to our plots directly as well. Let’s have a look at houses in Saratoga, New York.
m <- ggplot(data=SaratogaHouses, aes(x = age, y = price))
m1 <- m + geom_point()
m1 + stat_smooth(method = lm)
Now let’s play a little with this data
#lets fade the points a little
m2 <- m + geom_point(color="grey60") + stat_smooth(method=lm)
m2
# size the points by living area
m2 <- m + geom_point(color="grey60", mapping=aes(size=livingArea))+
stat_smooth(method=lm)
m2
# more fun, lets apply aesthetics for size and shape
m2 <- m + geom_point(mapping=aes(size=livingArea, color=bedrooms, shape=waterfront))+
stat_smooth(method=lm)
m2
Now lets start playing with scales. First, here are the possible shapes in R.
# lets make bedrooms into a factor so that ggplot will treat it as discrete values.
SaratogaHouses$bedrooms<-as.factor(SaratogaHouses$bedrooms)
# starting over
m <- ggplot(data=SaratogaHouses, aes(x = age, y = price))
# change the color scale
m2 <- m + geom_point(mapping=aes(size=livingArea, color=bedrooms, shape=waterfront))+ stat_smooth(method=lm) + scale_fill_discrete()
#lets choose our shapes too
m2 <- m + geom_point(mapping=aes(size=livingArea, color=bedrooms, shape=waterfront))+
stat_smooth(method=lm) +
scale_shape_manual(values=c(16,18)) + scale_color_discrete()
m2
We can annotate our plots thusly:
m3 <- m2 + annotate("text",x=25,y=4e+05,label="New Houses")
m3 <- m3 + annotate("text",x=150,y=1e+05,label="Old Houses")
m3 <- m3 + geom_abline(slope=0,intercept=2.5e+5)
m3 <- m3 + annotate("text",x=25,y=3e+05,label="Our budget")
m3
The first parameter of
annotate() can take any geom, so we could add boxes, line segments etc.
We can limit the x and y axes in a very similar way to the base package
m4 <- m3 + xlim(0,100)
m4
## Warning: Removed 83 rows containing non-finite values (stat_smooth).
## Warning: Removed 83 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text).
And we can change the names of the tick labels and axes, and add a title
#first lets do the actual regression to get some useful info to put in our title
ageprice<-lm(SaratogaHouses$price ~ SaratogaHouses$age)
adj.r.squared<-signif(summary(ageprice)$adj.r.squared,1)
pvalue<-signif(summary(ageprice)$coefficients[2,4],1)
m4 <- m2 + scale_y_continuous(breaks=seq(0,800000,by=100000),
labels = c("0","100K","200K","300K","400K","500K","600K","700K","800K")) +
xlab("Age in Years") +
ggtitle(paste("R^2 =",adj.r.squared,"and p =",pvalue,sep=" "))
m4
And finally we can use a logarithmic scale
m5 <- m4 + scale_x_log10()
m5
## Warning: Removed 83 rows containing non-finite values (stat_smooth).
It can be downright daunting learning ggplot, and difficult to show you everything in one class. That is why I recommend you get a feel for it by playing with a sort of ggplot GUI developed by the guys who make the mosaic package.
Do this:
library(mosaic)
mPlot(SAT, system = "ggplot")
This will load the SAT dataset into moscaic’s ggplot “interactor”.
data() to see a list of them. Or trydata(package = "mosaicData")
Then consider:
mPlot(Galton, system = "ggplot")
mPlot(Births78, system = "ggplot")
mPlot(mtcars,system = "ggplot")
Or plug your own data frame in there.
Births78$day_of_week <- wday(ymd(Births78$date), label = TRUE)
ggplot(data=Births78, aes(x=dayofyear, y=births, color = day_of_week)) + geom_point() + theme(legend.position="right") + labs(title="") + geom_smooth(method = "loess", alpha = .1)